Wine Quality Prediction
WINE!!!
Through this project we embark on an exhilarating journey through the world of wine, where we explore the delicate art of predicting wine quality. With each sip and swirl, we delve into the enchanting realm of data analysis and visualization, uncovering the hidden gems within the chemical attributes that define wine excellence.
LOADING THE DATASETS
| type | fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | quality |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| white | 7.0 | 0.27 | 0.36 | 20.7 | 0.045 | 45 | 170 | 1.001 | 3.00 | 0.45 | 8.8 | 6 |
| white | 6.3 | 0.30 | 0.34 | 1.6 | 0.049 | 14 | 132 | 0.994 | 3.30 | 0.49 | 9.5 | 6 |
| white | 8.1 | 0.28 | 0.40 | 6.9 | 0.050 | 30 | 97 | 0.995 | 3.26 | 0.44 | 10.1 | 6 |
| white | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47 | 186 | 0.996 | 3.19 | 0.40 | 9.9 | 6 |
| white | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47 | 186 | 0.996 | 3.19 | 0.40 | 9.9 | 6 |
| type | fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6493 | red | 6.2 | 0.600 | 0.08 | 2.0 | 0.090 | 32 | 44 | 0.995 | 3.45 | 0.58 | 10.5 | 5 |
| 6494 | red | 5.9 | 0.550 | 0.10 | 2.2 | 0.062 | 39 | 51 | 0.995 | 3.52 | NA | 11.2 | 6 |
| 6495 | red | 6.3 | 0.510 | 0.13 | 2.3 | 0.076 | 29 | 40 | 0.996 | 3.42 | 0.75 | 11.0 | 6 |
| 6496 | red | 5.9 | 0.645 | 0.12 | 2.0 | 0.075 | 32 | 44 | 0.996 | 3.57 | 0.71 | 10.2 | 5 |
| 6497 | red | 6.0 | 0.310 | 0.47 | 3.6 | 0.067 | 18 | 42 | 0.996 | 3.39 | 0.66 | 11.0 | 6 |
We have succesfully loaded the dataset. We can see that there are 6497 observations and 13 variables.
Next lets look at the summary statistics.
SUMMARY STATISTICS
| type | fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min | Length:6497 | Min. : 3.80 | Min. :0.08 | Min. :0.000 | Min. : 0.6 | Min. :0.009 | Min. : 1.0 | Min. : 6 | Min. :0.987 | Min. :2.72 | Min. :0.22 | Min. : 8.0 | Min. :3.00 |
| Q1 | Class :character | 1st Qu.: 6.40 | 1st Qu.:0.23 | 1st Qu.:0.250 | 1st Qu.: 1.8 | 1st Qu.:0.038 | 1st Qu.: 17.0 | 1st Qu.: 77 | 1st Qu.:0.992 | 1st Qu.:3.11 | 1st Qu.:0.43 | 1st Qu.: 9.5 | 1st Qu.:5.00 |
| Median | Mode :character | Median : 7.00 | Median :0.29 | Median :0.310 | Median : 3.0 | Median :0.047 | Median : 29.0 | Median :118 | Median :0.995 | Median :3.21 | Median :0.51 | Median :10.3 | Median :6.00 |
| Mean | NA | Mean : 7.22 | Mean :0.34 | Mean :0.319 | Mean : 5.4 | Mean :0.056 | Mean : 30.5 | Mean :116 | Mean :0.995 | Mean :3.22 | Mean :0.53 | Mean :10.5 | Mean :5.82 |
| Q3 | NA | 3rd Qu.: 7.70 | 3rd Qu.:0.40 | 3rd Qu.:0.390 | 3rd Qu.: 8.1 | 3rd Qu.:0.065 | 3rd Qu.: 41.0 | 3rd Qu.:156 | 3rd Qu.:0.997 | 3rd Qu.:3.32 | 3rd Qu.:0.60 | 3rd Qu.:11.3 | 3rd Qu.:6.00 |
| Max | NA | Max. :15.90 | Max. :1.58 | Max. :1.660 | Max. :65.8 | Max. :0.611 | Max. :289.0 | Max. :440 | Max. :1.039 | Max. :4.01 | Max. :2.00 | Max. :14.9 | Max. :9.00 |
| NA | NA | NA’s :10 | NA’s :8 | NA’s :3 | NA’s :2 | NA’s :2 | NA | NA | NA | NA’s :9 | NA’s :4 | NA | NA |
A quick look at the summary tells us the inter-quartile ranges and maximum and minimum values for each variable.
Observations:
There are also some NA values which we need to remove.
There might be duplicates in the dataset.
CLEANING THE DATA
Removing Duplicates| type | fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min | Length:5329 | Min. : 3.80 | Min. :0.08 | Min. :0.000 | Min. : 0.6 | Min. :0.009 | Min. : 1.0 | Min. : 6 | Min. :0.987 | Min. :2.72 | Min. :0.22 | Min. : 8.0 | Min. :3.0 |
| Q1 | Class :character | 1st Qu.: 6.40 | 1st Qu.:0.23 | 1st Qu.:0.240 | 1st Qu.: 1.8 | 1st Qu.:0.038 | 1st Qu.: 16.0 | 1st Qu.: 75 | 1st Qu.:0.992 | 1st Qu.:3.11 | 1st Qu.:0.43 | 1st Qu.: 9.5 | 1st Qu.:5.0 |
| Median | Mode :character | Median : 7.00 | Median :0.30 | Median :0.310 | Median : 2.7 | Median :0.047 | Median : 28.0 | Median :116 | Median :0.995 | Median :3.21 | Median :0.51 | Median :10.4 | Median :6.0 |
| Mean | NA | Mean : 7.22 | Mean :0.34 | Mean :0.319 | Mean : 5.1 | Mean :0.057 | Mean : 30.1 | Mean :114 | Mean :0.995 | Mean :3.22 | Mean :0.53 | Mean :10.6 | Mean :5.8 |
| Q3 | NA | 3rd Qu.: 7.70 | 3rd Qu.:0.41 | 3rd Qu.:0.400 | 3rd Qu.: 7.5 | 3rd Qu.:0.066 | 3rd Qu.: 41.0 | 3rd Qu.:154 | 3rd Qu.:0.997 | 3rd Qu.:3.33 | 3rd Qu.:0.60 | 3rd Qu.:11.4 | 3rd Qu.:6.0 |
| Max | NA | Max. :15.90 | Max. :1.58 | Max. :1.660 | Max. :65.8 | Max. :0.611 | Max. :289.0 | Max. :440 | Max. :1.039 | Max. :4.01 | Max. :2.00 | Max. :14.9 | Max. :9.0 |
| NA | NA | NA’s :10 | NA’s :8 | NA’s :3 | NA’s :2 | NA’s :2 | NA | NA | NA | NA’s :9 | NA’s :4 | NA | NA |
Duplicates can cause issues later with biases. Thus, we have removed the duplicate values.
Now, the dataset has 5329 observations.
Removing NA Values| type | fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min | Length:5295 | Min. : 3.80 | Min. :0.080 | Min. :0.000 | Min. : 0.6 | Min. :0.009 | Min. : 1 | Min. : 6 | Min. :0.987 | Min. :2.72 | Min. :0.220 | Min. : 8.0 | Min. :3.0 |
| Q1 | Class :character | 1st Qu.: 6.40 | 1st Qu.:0.230 | 1st Qu.:0.240 | 1st Qu.: 1.8 | 1st Qu.:0.038 | 1st Qu.: 16 | 1st Qu.: 74 | 1st Qu.:0.992 | 1st Qu.:3.11 | 1st Qu.:0.430 | 1st Qu.: 9.5 | 1st Qu.:5.0 |
| Median | Mode :character | Median : 7.00 | Median :0.300 | Median :0.310 | Median : 2.7 | Median :0.047 | Median : 28 | Median :116 | Median :0.995 | Median :3.21 | Median :0.510 | Median :10.4 | Median :6.0 |
| Mean | NA | Mean : 7.22 | Mean :0.344 | Mean :0.319 | Mean : 5.1 | Mean :0.057 | Mean : 30 | Mean :114 | Mean :0.995 | Mean :3.22 | Mean :0.533 | Mean :10.6 | Mean :5.8 |
| Q3 | NA | 3rd Qu.: 7.70 | 3rd Qu.:0.410 | 3rd Qu.:0.400 | 3rd Qu.: 7.5 | 3rd Qu.:0.066 | 3rd Qu.: 41 | 3rd Qu.:154 | 3rd Qu.:0.997 | 3rd Qu.:3.33 | 3rd Qu.:0.600 | 3rd Qu.:11.4 | 3rd Qu.:6.0 |
| Max | NA | Max. :15.90 | Max. :1.580 | Max. :1.660 | Max. :65.8 | Max. :0.611 | Max. :289 | Max. :440 | Max. :1.039 | Max. :4.01 | Max. :2.000 | Max. :14.9 | Max. :9.0 |
After, removing the NAs, we are finally left with 5295 observations. The dataset is now clean.
Now, the data is clean. We have removed the NAs, and duplicates. We can now start our exploratory data analysis.
Observations:
- There is a big difference between maximum value and Q3 values. This means that there are a quite a lot of outliers.
Let’s visualize this using plots.
UNIVARIATE PLOTS
To understand in depth as to which factor effects the quality the most.
Let’s look at the individual variables in the data set.
Observations:
Wine quality shows a rather symmetrical distribution.
Most wines have a quality score of 6.
No wine achieved the highest score of 10 and the worst wines got a rating of 3.
Let’s see how the other factors are effecting!
Observations:
- Most distributions encountered during the exploration of the parameters looked rather usual. In general, they were positively skewed with a narrow main peak.
We will also have a look at at the box plots for these!
p1 <- ggplot(data = wine, aes(x = "", y = fixed.acidity )) +
geom_boxplot(color = 'black', fill = I('white')) +
labs(
x = "Fixed Acidity",
y = "TaOH Concentration [g/L]"
)
p2 <- ggplot(data = wine, aes(x = "", y = volatile.acidity)) +
geom_boxplot(color = 'black', fill = I('white')) +
labs(
x = "Volatile Acidity",
y = "AcOH Concentration [g/L]"
)
p3 <- ggplot(data = wine, aes(x = "", y = citric.acid)) +
geom_boxplot(color = 'black', fill = I('white')) +
labs(
x = "Citric Acidity",
y = "Concentration [g/L]"
)
p4 <- ggplot(data = wine, aes(x = "", y = pH))+
geom_boxplot(color = 'black', fill = I('white')) +
labs(
x = "pH",
y = "pH"
)
grid.arrange(p1,p2, p3,p4, nrow = 1)Observations:
Residual sugar has a very long-tail distribution with many outliers. It will be interesting to see how these outliers affect the quality of wine.
Chlorides have distribution similar to residual sugar and have a strong concentration around the median. We also note a lot of outliers from the box plot.
Most wines have less than 11% alcohol.
Density has a very normal looking distribution with most of the values falling between 0.995 and 1.
SUMMARY OF UNIVARIATE PLOTS
In general, the variables were positively skewed with a narrow main peak.
Most wines have a pH of 3.2. Since we have chlorides,citric acid, and fixed and volatile acidity, the wines were bound to be on the acidic side.
The wines have an alcohol content ranging between 8 and 15 vol%.
Now let us see that how different factors are related to the quality!
CORRELATION MATRIX
First, we will build correlation matrix to identify the variables
which influence quality the most.
Observations:
Our Target Variable is Quality, so we will focus on only those parameters which influence quality.
Alcohol and quality have a high positive correlation.
Density and quality have a negative correlation.
Low correlation between quality and chloride concentration.
Wine quality also slightly negatively correlates with volatile acidity.
It has a slight positive correlation with citric acid.
From the correlation matrix, let us now see how these parameters vary over different quality ratings for red and white wine separately and find out how they are different.
SUBSETTING DATA INTO RED AND WHITE WINEs
We will now subset the data into white and red wine separately to go ahead with our analysis.
BIVARIATE PLOTS
Through our correlation plot we have understood that Alcohol content, density, citric acid and chloride are the ones that are affecting the quality the most, let us see how and also make a comparitive analysis between red and white wine individually!
1. ALCOHOL VS QUALITY
The boxplot shows that wines with higher quality seem to have a higher alcohol content.
T-Test
WELCH TWO SAMPLE T-TEST
NULL HYPOTHESIS (H0): There is no significant mean difference between red and white wine in alcohol content.
ALTERNATE HYPOTHESIS (H1): There is significant mean difference between the two wines.
Observations:
- Since the p-value of 3.606^{-6} is less than 0.05, we reject the null hypothesis and conclude that there is a significant difference in mean alcohol concentration between the two wines.
Let us check how it varies for red and white wine.
Observations:
White Wines have higher alcohol content.
Alcohol has a strong positive correlation with quality.
ANOVA TEST FOR RED WINE
NULL HYPOTHESIS (H0): There is no significant difference in mean alcohol content across quality categories in red wine
ALTERNATE HYPOTHESIS (H1): There is significant difference in mean
alcohol content across quality categories in red wine Df Sum Sq Mean Sq
F value Pr(>F)
as.factor(quality) 5 438 87.6 103 <2e-16 *** Residuals 1347 1144
0.8
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05
‘.’ 0.1 ’ ’ 1 Length Class Mode
as.factor(quality) 60 -none- numeric
Observations:
From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean alcohol content across all categories of red wine.
We also have done Tukey Test to check in which quality ratings, there is a significant difference in alcohol level.
ANOVA TEST FOR WHITE WINE
NULL HYPOTHESIS (H0): There is no significant difference in mean alcohol content across quality categories in white wine
ALTERNATE HYPOTHESIS (H1): There is significant difference in mean alcohol content across quality categories in white wine
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(quality) 6 1467 244.5 220 <2e-16 ***
## Residuals 3935 4377 1.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = alcohol ~ as.factor(quality), data = white)
##
## $`as.factor(quality)`
## diff lwr upr p adj
## 4-3 -0.147 -0.8867 0.5934 0.997
## 5-3 -0.479 -1.1809 0.2223 0.405
## 6-3 0.304 -0.3957 1.0035 0.861
## 7-3 1.175 0.4692 1.8808 0.000
## 8-3 1.535 0.7882 2.2821 0.000
## 9-3 1.835 0.2794 3.3906 0.009
## 5-4 -0.333 -0.6009 -0.0644 0.005
## 6-4 0.451 0.1876 0.7134 0.000
## 7-4 1.322 1.0427 1.6006 0.000
## 8-4 1.682 1.3109 2.0527 0.000
## 9-4 1.982 0.5675 3.3957 0.001
## 6-5 0.783 0.6660 0.9003 0.000
## 7-5 1.654 1.5046 1.8040 0.000
## 8-5 2.014 1.7278 2.3011 0.000
## 9-5 2.314 0.9199 3.7087 0.000
## 7-6 0.871 0.7313 1.0111 0.000
## 8-6 1.231 0.9496 1.5130 0.000
## 9-6 1.531 0.1378 2.9245 0.020
## 8-7 0.360 0.0634 0.6568 0.006
## 9-7 0.660 -0.7365 2.0564 0.805
## 9-8 0.300 -1.1179 1.7176 0.996
Observations:
From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean alcohol content across all categories of red wine.
We also have done Tukey Test to check in which quality ratings, there is a significant difference in alcohol level.
2. DENSITY VS QUALITY
The boxplot shows that wines with higher quality seem to have a less denser.
T-Test
WELCH TWO SAMPLE T-TEST
NULL HYPOTHESIS (H0): There is no significant mean difference between red and white wine in density level.
ALTERNATE HYPOTHESIS (H1): There is significant mean difference between the tw0 wines in density level. [1] 6.92e-322
Observations:
- Since the p-value of 7^{-322} is less than 0.05, we reject the null hypothesis and conclude that there is a significant difference in mean densitylevel between the two wines.
Let us check how it varies for red and white wine
Observations:
Red Wines are more dense than white wines.
Density has a negative correlation with quality.
ANOVA TEST FOR RED WINE
NULL HYPOTHESIS (H0): There is no significant difference in mean density level across quality categories in red wine.
ALTERNATE HYPOTHESIS (H1): There is significant difference in mean density level across quality categories in red wine.
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(quality) 5 0.00021 4.27e-05 12.8 4e-12 ***
## Residuals 1347 0.00451 3.40e-06
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = density ~ as.factor(quality), data = red, conf.level = 0.95)
##
## $`as.factor(quality)`
## diff lwr upr p adj
## 4-3 -0.000926 -0.002730 8.77e-04 0.686
## 5-3 -0.000380 -0.002047 1.29e-03 0.987
## 6-3 -0.000886 -0.002553 7.82e-04 0.654
## 7-3 -0.001413 -0.003114 2.88e-04 0.167
## 8-3 -0.002369 -0.004451 -2.87e-04 0.015
## 5-4 0.000546 -0.000210 1.30e-03 0.309
## 6-4 0.000041 -0.000718 8.00e-04 1.000
## 7-4 -0.000486 -0.001316 3.43e-04 0.550
## 8-4 -0.001442 -0.002902 1.73e-05 0.055
## 6-5 -0.000505 -0.000819 -1.91e-04 0.000
## 7-5 -0.001033 -0.001492 -5.73e-04 0.000
## 8-5 -0.001988 -0.003274 -7.03e-04 0.000
## 7-6 -0.000527 -0.000991 -6.39e-05 0.015
## 8-6 -0.001483 -0.002770 -1.96e-04 0.013
## 8-7 -0.000956 -0.002286 3.74e-04 0.314
Observations:
From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean density across all categories of red wine.
We also have done Tukey Test to check in which quality ratings, there is a significant difference in density level.
ANOVA TEST FOR WHITE WINE
NULL HYPOTHESIS (H0): There is no significant difference in mean density level across quality categories in white wine.
ALTERNATE HYPOTHESIS (H1): There is significant difference in mean density level across quality categories in white wine.
Df Sum Sq Mean Sq F value Pr(>F)
as.factor(quality) 6 0.00458 0.000764 105 <2e-16 *** Residuals
3935 0.02872 0.000007
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05
‘.’ 0.1 ’ ’ 1 Tukey multiple comparisons of means 95% family-wise
confidence level
Fit: aov(formula = density ~ as.factor(quality), data = white)
$as.factor(quality) diff lwr upr p adj 4-3 -0.000695
-0.00259 1.20e-03 0.934 5-3 0.000182 -0.00161 1.98e-03 1.000 6-3
-0.001160 -0.00295 6.33e-04 0.475 7-3 -0.002824 -0.00463 -1.02e-03 0.000
8-3 -0.003139 -0.00505 -1.23e-03 0.000 9-3 -0.003424 -0.00741 5.61e-04
0.147 5-4 0.000877 0.00019 1.56e-03 0.003 6-4 -0.000465 -0.00114
2.09e-04 0.392 7-4 -0.002129 -0.00284 -1.41e-03 0.000 8-4 -0.002444
-0.00339 -1.49e-03 0.000 9-4 -0.002729 -0.00635 8.93e-04 0.283 6-5
-0.001342 -0.00164 -1.04e-03 0.000 7-5 -0.003006 -0.00339 -2.62e-03
0.000 8-5 -0.003321 -0.00406 -2.59e-03 0.000 9-5 -0.003606 -0.00718
-3.44e-05 0.046 7-6 -0.001664 -0.00202 -1.31e-03 0.000 8-6 -0.001979
-0.00270 -1.26e-03 0.000 9-6 -0.002264 -0.00583 1.30e-03 0.500 8-7
-0.000315 -0.00108 4.45e-04 0.885 9-7 -0.000600 -0.00418 2.98e-03 0.999
9-8 -0.000285 -0.00392 3.35e-03 1.000
Observations:
From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean density level across all categories of red wine.
We also have done Tukey Test to check in which quality ratings, there is a significant difference in density level.
3. CHLORIDES VS QUALITY
The boxplot shows that wines with higher quality seem to have a less chlorides.
T-Test
WELCH TWO SAMPLE T-TEST
NULL HYPOTHESIS (H0): There is no significant mean difference between red and white wine in chloride concentration.
ALTERNATE HYPOTHESIS (H1): There is significant mean difference between the tw0 wines in chloride concentration. [1] 3.71e-159
Observations:s
- Since the p-value of 3.706^{-159} is less than 0.05, we reject the null hypothesis and conclude that there is a significant difference in mean chloride concentration between the two wines.
Let us check how it varies for red and white wine
Observations:
Red Wines have more chloride concentration than white wines.
Chloride Concentration has a slight negative correlation with quality
ANOVA TEST FOR RED WINE
NULL HYPOTHESIS (H0): There is no significant difference in mean chloride contentration across quality categories in red wine.
ALTERNATE HYPOTHESIS (H1): There is significant difference in mean chloride contentration across quality categories in red wine.
Df Sum Sq Mean Sq F value Pr(>F)
as.factor(quality) 5 0.06 0.01286 5.34 7.2e-05 *** Residuals 1347
3.24 0.00241
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05
‘.’ 0.1 ’ ’ 1 Tukey multiple comparisons of means 95% family-wise
confidence level
Fit: aov(formula = chlorides ~ as.factor(quality), data = red, conf.level = 0.95)
$as.factor(quality) diff lwr upr p adj 4-3 -0.03244
-0.0808 1.59e-02 0.393 5-3 -0.02851 -0.0732 1.62e-02 0.452 6-3 -0.03730
-0.0820 7.40e-03 0.164 7-3 -0.04567 -0.0913 -7.67e-05 0.049 8-3 -0.05415
-0.1100 1.66e-03 0.063 5-4 0.00394 -0.0163 2.42e-02 0.994 6-4 -0.00485
-0.0252 1.55e-02 0.984 7-4 -0.01323 -0.0355 9.01e-03 0.534 8-4 -0.02170
-0.0608 1.74e-02 0.610 6-5 -0.00879 -0.0172 -3.65e-04 0.035 7-5 -0.01716
-0.0295 -4.85e-03 0.001 8-5 -0.02564 -0.0601 8.82e-03 0.276 7-6 -0.00837
-0.0208 4.05e-03 0.389 8-6 -0.01685 -0.0514 1.77e-02 0.731 8-7 -0.00848
-0.0441 2.72e-02 0.984
Observations
From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean chloride concentration across all categories of red wine.
We also have done Tukey Test to check in which quality ratings, there is a significant differnece in chloride concentration.
ANOVA TEST FOR WHITE WINE
NULL HYPOTHESIS (H0): There is no significant difference in mean chloride contentration across quality categories in white wine.
ALTERNATE HYPOTHESIS (H1): There is significant difference in mean chloride contentration across quality categories in white wine.
Df Sum Sq Mean Sq F value Pr(>F)
as.factor(quality) 6 0.114 0.0190 37.7 <2e-16 *** Residuals 3935
1.987 0.0005
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05
‘.’ 0.1 ’ ’ 1 Tukey multiple comparisons of means 95% family-wise
confidence level
Fit: aov(formula = chlorides ~ as.factor(quality), data = white)
$as.factor(quality) diff lwr upr p adj 4-3 -0.004162
-0.01993 0.011603 0.987 5-3 -0.001961 -0.01691 0.012985 1.000 6-3
-0.009171 -0.02407 0.005733 0.538 7-3 -0.016760 -0.03180 -0.001725 0.018
8-3 -0.017544 -0.03346 -0.001633 0.020 9-3 -0.026900 -0.06004 0.006239
0.201 5-4 0.002201 -0.00351 0.007915 0.917 6-4 -0.005009 -0.01061
0.000592 0.115 7-4 -0.012598 -0.01854 -0.006655 0.000 8-4 -0.013382
-0.02128 -0.005481 0.000 9-4 -0.022738 -0.05286 0.007386 0.281 6-5
-0.007210 -0.00970 -0.004714 0.000 7-5 -0.014799 -0.01799 -0.011609
0.000 8-5 -0.015583 -0.02169 -0.009476 0.000 9-5 -0.024939 -0.05464
0.004765 0.168 7-6 -0.007589 -0.01057 -0.004609 0.000 8-6 -0.008373
-0.01437 -0.002373 0.001 9-6 -0.017729 -0.04741 0.011953 0.574 8-7
-0.000784 -0.00710 0.005536 1.000 9-7 -0.010140 -0.03989 0.019609 0.953
9-8 -0.009356 -0.03956 0.020846 0.970
Observations:
From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean chloride concentration across all categories of red wine.
We also have done Tukey Test to check in which quality ratings, there is a significant difference in chloride concentration.
4. CITRIC ACID VS QUALITY
The boxplot shows that wines with higher quality seem to have a high citric acid.
T-Test
WELCH TWO SAMPLE T-TEST
NULL HYPOTHESIS (H0): There is no significant mean difference between red and white wine in citric acid concentration.
ALTERNATE HYPOTHESIS (H1): There is significant mean difference between the tw0 wines in citric acid concentration. [1] 1.33e-26
Observations:
- Since the p-value of 1.335^{-26} is less than 0.05, we reject the null hypothesis and conclude that there is a significant difference in mean citric acid concentration between the two wines.
Let us check how it varies for red and white wine
Observations:
White Wines have more citric acid concentration than red wines.
Citric ACid Concentration has a slight positive correlation with quality.
There isn’t much difference in citric acid concentration in white wines across the quality ratings.
ANOVA TEST FOR RED WINE
NULL HYPOTHESIS (H0): There is no significant difference in mean citric acid level across quality categories in red wine.
ALTERNATE HYPOTHESIS (H1): There is significant difference in mean
citric acid level across quality categories in red wine. Df Sum Sq Mean
Sq F value Pr(>F)
as.factor(quality) 5 2.9 0.584 16.1 1.8e-15 *** Residuals 1347 48.8
0.036
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05
‘.’ 0.1 ’ ’ 1 Tukey multiple comparisons of means 95% family-wise
confidence level
Fit: aov(formula = citric.acid ~ as.factor(quality), data = red, conf.level = 0.95)
$as.factor(quality) diff lwr upr p adj 4-3 0.00477
-0.18279 0.1923 1.000 5-3 0.07377 -0.09948 0.2470 0.830 6-3 0.10949
-0.06389 0.2829 0.465 7-3 0.20086 0.02402 0.3777 0.015 8-3 0.21194
-0.00453 0.4284 0.059 5-4 0.06901 -0.00965 0.1477 0.124 6-4 0.10472
0.02579 0.1836 0.002 7-4 0.19609 0.10983 0.2823 0.000 8-4 0.20717
0.05542 0.3589 0.001 6-5 0.03572 0.00304 0.0684 0.023 7-5 0.12708
0.07934 0.1748 0.000 8-5 0.13817 0.00450 0.2718 0.038 7-6 0.09137
0.04318 0.1396 0.000 8-6 0.10245 -0.03138 0.2363 0.246 8-7 0.01108
-0.12720 0.1494 1.000
Observations:
From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean citric acid across all categories of red wine.
We also have done Tukey Test to check in which quality ratings, there is a significant difference in citric acid level.
ANOVA TEST FOR WHITE WINE
NULL HYPOTHESIS (H0): There is no significant difference in mean citric acid level across quality categories in white wine.
ALTERNATE HYPOTHESIS (H1): There is significant difference in mean citric acid level across quality categories in white wine.
Df Sum Sq Mean Sq F value Pr(>F)
as.factor(quality) 6 0.2 0.0363 2.43 0.024 * Residuals 3935 58.8
0.0150
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05
‘.’ 0.1 ’ ’ 1 Tukey multiple comparisons of means 95% family-wise
confidence level
Fit: aov(formula = citric.acid ~ as.factor(quality), data = white)
$as.factor(quality) diff lwr upr p adj 4-3 -0.031789
-0.117583 0.05400 0.930 5-3 0.000279 -0.081059 0.08162 1.000 6-3
0.002416 -0.078687 0.08352 1.000 7-3 -0.008569 -0.090389 0.07325 1.000
8-3 -0.000962 -0.087551 0.08563 1.000 9-3 0.050000 -0.130341 0.23034
0.983 5-4 0.032068 0.000969 0.06317 0.038 6-4 0.034205 0.003727 0.06468
0.016 7-4 0.023220 -0.009118 0.05556 0.342 8-4 0.030828 -0.012172
0.07383 0.344 9-4 0.081789 -0.082144 0.24572 0.762 6-5 0.002137
-0.011441 0.01572 0.999 7-5 -0.008848 -0.026203 0.00851 0.743 8-5
-0.001241 -0.034472 0.03199 1.000 9-5 0.049721 -0.111925 0.21137 0.972
7-6 -0.010985 -0.027202 0.00523 0.416 8-6 -0.003378 -0.036030 0.02927
1.000 9-6 0.047584 -0.113944 0.20911 0.977 8-7 0.007608 -0.026787
0.04200 0.995 9-7 0.058569 -0.103320 0.22046 0.938 9-8 0.050962
-0.113390 0.21531 0.970
Observations:
- From the ANOVA test, we can see that the P-vale is significantly less than 0.05. So, we reject the null hypothesis and conclude that there is a significant difference in mean citric acid level across all categories of red wine.
CORRELATION FOR THE BI-VARIATE PLOTS.
Observations:
Data cleaning is performed on both white and red in the above code, deleting rows with missing values in certain columns. Correlation matrices are subsequently computed for the cleaned datasets. The correlation matrices from the white and red wine datasets are averaged in the combined correlation matrix. A correlation plot with a customized appearance is built.
SUMMARY OF BIVARIATE PLOTS
From the bivariate plots, we have concluded that :
For good quality wines, the alcohol content is more.
As the density decreases, the quality gets better.
As the chloride concentration decreases, quality gets better.
For better wines, the citric acid concentration is more.
Whites wines generally have less density and have more alcohol conent.
MULTIVARIATE PLOTS
For the last part of our EDA, we will perform some multivariate plots to see some how the other non-important featues in wine are distributed in red and white wine.
Observations:
White Wines have more sugar concentration than red wines. This might explain why white wines are usually more sweet.
Red Wines have more sulphate concentration.
LIMITATIONS
Although we thoroughly believe in our analysis, we have to mention a few anomalies that are present that may or may not have influenced the results.
The number of data that we have on white wine is comparatively more than that of red wine.
We have also observed that most of the data present are in the average quality range i.e, from 4-8.
For a better analysis, we need a more balanced data.
CONCLUSIONS FROM EDA
In our preliminary analysis, we have uncovered some key insights:
Wines with elevated alcohol content, increased citric acid levels, lower density, and reduced chlorides tend to exhibit higher quality.
It appears that white wines, in general, tend to be sweeter and have higher alcohol content when compared to their red counterparts.
With this new found knowledge, you’re better equipped to make informed choices when selecting wines!!
MODELLING THE DATA
From our EDA, we have identified some key features which influence the target variable- Quality.
As part of our next analysis, we will now perform some regression models to see which model predicts the quality of wine with most accuracy.
1. LINEAR REGRESSION
loadPkg("leaps")
reg.best2 <- regsubsets(quality~., data = numeric_data, nbest = 2, method = "exhaustive")
plot(reg.best2, scale = "adjr2", main = "Adjusted R^2")model1 = lm(quality ~ volatile.acidity+residual.sugar+free.sulfur.dioxide+total.sulfur.dioxide+pH+sulphates+alcohol, data=numeric_data)
sum_md1 = summary(model1) # also for later use on inline codes
sum_md1##
## Call:
## lm(formula = quality ~ volatile.acidity + residual.sugar + free.sulfur.dioxide +
## total.sulfur.dioxide + pH + sulphates + alcohol, data = numeric_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.992 -0.450 -0.024 0.462 3.144
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.391874 0.240127 5.80 7.2e-09 ***
## volatile.acidity -1.390559 0.068586 -20.27 < 2e-16 ***
## residual.sugar 0.017460 0.002675 6.53 7.3e-11 ***
## free.sulfur.dioxide 0.006692 0.000825 8.11 6.4e-16 ***
## total.sulfur.dioxide -0.002224 0.000285 -7.81 6.7e-15 ***
## pH 0.273930 0.066893 4.10 4.3e-05 ***
## sulphates 0.620295 0.071304 8.70 < 2e-16 ***
## alcohol 0.344362 0.009197 37.44 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.734 on 5287 degrees of freedom
## Multiple R-squared: 0.305, Adjusted R-squared: 0.304
## F-statistic: 331 on 7 and 5287 DF, p-value: <2e-16
predicted_values <- predict(model1, newdata = numeric_data)
# Calculate residuals (the differences between predicted and actual values)
residuals <- predicted_values - numeric_data$quality
# Calculate Mean Squared Error (MSE)
mse <- mean(residuals^2)
print(sqrt(0.53))## [1] 0.728
## MSE 0.538
2. LOGISTIC REGRESSION
For Logistic Regression, we want the outcome variable to be categorical. Hence,w e are converting all the wines from 0 to 6 to “Bad Wines” and from 6-10 as “Good Wines”.
| type | fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | quality | quality_category | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | white | 7.0 | 0.27 | 0.36 | 20.7 | 0.045 | 45 | 170 | 1.001 | 3.00 | 0.45 | 8.8 | 6 | 1 |
| 2 | white | 6.3 | 0.30 | 0.34 | 1.6 | 0.049 | 14 | 132 | 0.994 | 3.30 | 0.49 | 9.5 | 6 | 1 |
| 3 | white | 8.1 | 0.28 | 0.40 | 6.9 | 0.050 | 30 | 97 | 0.995 | 3.26 | 0.44 | 10.1 | 6 | 1 |
| 4 | white | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47 | 186 | 0.996 | 3.19 | 0.40 | 9.9 | 6 | 1 |
| 7 | white | 6.2 | 0.32 | 0.16 | 7.0 | 0.045 | 30 | 136 | 0.995 | 3.18 | 0.47 | 9.6 | 6 | 1 |
| 10 | white | 8.1 | 0.22 | 0.43 | 1.5 | 0.044 | 28 | 129 | 0.994 | 3.22 | 0.45 | 11.0 | 6 | 1 |
winelogit1 <- glm(quality_category ~volatile.acidity+residual.sugar+pH+sulphates+alcohol+free.sulfur.dioxide+total.sulfur.dioxide, data = wine, binomial(link = "logit"))
summary(winelogit1)##
## Call:
## glm(formula = quality_category ~ volatile.acidity + residual.sugar +
## pH + sulphates + alcohol + free.sulfur.dioxide + total.sulfur.dioxide,
## family = binomial(link = "logit"), data = wine)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.17e+01 8.29e-01 -14.07 < 2e-16 ***
## volatile.acidity -4.28e+00 2.45e-01 -17.45 < 2e-16 ***
## residual.sugar 5.22e-02 8.64e-03 6.04 1.5e-09 ***
## pH 8.50e-01 2.24e-01 3.80 0.00015 ***
## sulphates 1.89e+00 2.48e-01 7.64 2.1e-14 ***
## alcohol 9.57e-01 3.71e-02 25.78 < 2e-16 ***
## free.sulfur.dioxide 1.84e-02 2.80e-03 6.59 4.5e-11 ***
## total.sulfur.dioxide -7.04e-03 9.27e-04 -7.59 3.1e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6999.2 on 5294 degrees of freedom
## Residual deviance: 5444.6 on 5287 degrees of freedom
## AIC: 5461
##
## Number of Fisher Scoring iterations: 4
## (Intercept) volatile.acidity residual.sugar
## 8.53e-06 1.39e-02 1.05e+00
## pH sulphates alcohol
## 2.34e+00 6.65e+00 2.60e+00
## free.sulfur.dioxide total.sulfur.dioxide
## 1.02e+00 9.93e-01
| Predicted 0 | Predicted 1 | Total | |
|---|---|---|---|
| Actual 0 | 1185 | 794 | 1979 |
| Actual 1 | 530 | 2786 | 3316 |
| Total | 1715 | 3580 | 5295 |
#install.packages("pROC")
loadPkg(pROC) # receiver operating characteristic curve, gives the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The curve is on sensitivity/recall/true-positive-rate vs false_alarm/false-positive-rate/fall-out.
wine$prob=predict(winelogit1, type = c("response")) # Add new column of predicted probabilities
h <- roc(quality_category~prob, data=wine)
auc(h) # area-under-curve prefer 0.8 or higher.Area under the curve: 0.808
#### Hosmer and Lemeshow test
########The Hosmer and Lemeshow Goodness of Fit test can be used to evaluate logistic regression fit.
loadPkg(ResourceSelection) # function hoslem.test( ) for logit model evaluation
winelogit1_Hoslem = hoslem.test(wine$quality_category, fitted(winelogit1)) # Hosmer and Lemeshow test, a chi-squared test
unloadPkg(ResourceSelection)
winelogit1_HoslemHosmer and Lemeshow goodness of fit (GOF) test
data: wine$quality_category, fitted(winelogit1) X-squared = 11, df = 8, p-value = 0.2
# We can try this just to confirm the following calculation
wine_nulllogit1 <- glm(quality_category ~ 1, data = wine, family = "binomial")
mcFadden = 1 - logLik(winelogit1)/logLik(wine_nulllogit1)
mcFadden‘log Lik.’ 0.222 (df=8)
loadPkg(pscl) # use pR2( ) function to calculate McFadden statistics for model eval
wineLogit1pr2 = pR2(winelogit1)## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML r2CU
## -2722.313 -3499.575 1554.526 0.222 0.254 0.347
true_positives <- 2786
true_negatives <- 1185
false_positives <-794
false_negatives <-530
total <- 5295
# Calculate accuracy
accuracy <- (true_positives + true_negatives) / total
cat("Accuracy of logistic regression model is", accuracy,"\n")Accuracy of logistic regression model is 0.75
precision <- true_positives / (true_positives + false_positives)
cat(" Precision of logistic regression model is", precision, "\n")Precision of logistic regression model is 0.778
3. KNN
loadPkg("class")
library(class)
set.seed(123)
sample_indices <- sample(1:nrow(wine), 0.7 * nrow(wine))
train_data <- wine[sample_indices, ]
test_data <- wine[-sample_indices, ]
library(class)
# Define predictors and response variable
predictors <- names(wine)[2:13] # Adjust column indices based on your data
response <- "quality"
train_data$quality_category <- as.factor(train_data$quality_category)
k <- 5 # You can adjust the value of k
knn_model <- knn(train = train_data[, predictors], test = test_data[, predictors], cl = train_data$quality_category, k = k)
confusion_matrix <- table(Actual = test_data$quality_category, Predicted = knn_model)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
precision <-confusion_matrix[2,2]/sum(diag(confusion_matrix))
confusion_matrix## Predicted
## Actual 0 1
## 0 329 269
## 1 183 808
## Predicted
## Actual 0 1
## 0 329 269
## 1 183 808
## Accuracy: 0.716
## Precision : 0.711
## Factor w/ 2 levels "0","1": 2 1 2 1 2 2 2 2 2 2 ...
## [1] 1589
## knn_model
## 0 1
## 512 1077
library(caret)
set.seed(123)
cv <- trainControl(method = "cv", number = 10)
model <- train(quality_category ~ ., data = train_data, method = "knn", trControl = cv)
model## k-Nearest Neighbors
##
## 3706 samples
## 14 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 3334, 3335, 3335, 3336, 3336, 3335, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.712 0.361
## 7 0.711 0.357
## 9 0.706 0.340
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
4. DECISION TREES
library(rpart)
set.seed(123)
trainIndex <- sample(nrow(wine), 0.9 * nrow(wine))
trainData <- wine[trainIndex, ]
testData <- wine[-trainIndex, ]
tree_model <- rpart(quality ~ fixed.acidity + volatile.acidity + residual.sugar + free.sulfur.dioxide + density + pH + sulphates + alcohol, data = trainData)
print(summary(tree_model))## Call:
## rpart(formula = quality ~ fixed.acidity + volatile.acidity +
## residual.sugar + free.sulfur.dioxide + density + pH + sulphates +
## alcohol, data = trainData)
## n= 4765
##
## CP nsplit rel error xerror xstd
## 1 0.1665 0 1.000 1.000 0.0220
## 2 0.0338 1 0.834 0.844 0.0207
## 3 0.0321 2 0.800 0.811 0.0206
## 4 0.0136 3 0.768 0.789 0.0199
## 5 0.0100 4 0.754 0.783 0.0197
##
## Variable importance
## alcohol density volatile.acidity sulphates
## 49 25 12 5
## fixed.acidity residual.sugar free.sulfur.dioxide pH
## 4 2 2 1
##
## Node number 1: 4765 observations, complexity param=0.166
## mean=5.8, MSE=0.781
## left son=2 (2767 obs) right son=3 (1998 obs)
## Primary splits:
## alcohol < 10.6 to the left, improve=0.1660, (0 missing)
## density < 0.992 to the right, improve=0.1030, (0 missing)
## volatile.acidity < 0.458 to the right, improve=0.0519, (0 missing)
## free.sulfur.dioxide < 18.5 to the left, improve=0.0223, (0 missing)
## sulphates < 0.685 to the left, improve=0.0132, (0 missing)
## Surrogate splits:
## density < 0.993 to the right, agree=0.790, adj=0.500, (0 split)
## sulphates < 0.405 to the right, agree=0.607, adj=0.062, (0 split)
## fixed.acidity < 6.05 to the right, agree=0.603, adj=0.054, (0 split)
## residual.sugar < 5.28 to the right, agree=0.586, adj=0.014, (0 split)
## pH < 3.54 to the left, agree=0.583, adj=0.007, (0 split)
##
## Node number 2: 2767 observations, complexity param=0.0338
## mean=5.49, MSE=0.553
## left son=4 (1498 obs) right son=5 (1269 obs)
## Primary splits:
## volatile.acidity < 0.282 to the right, improve=0.08210, (0 missing)
## alcohol < 10.1 to the left, improve=0.03980, (0 missing)
## free.sulfur.dioxide < 24.5 to the left, improve=0.02160, (0 missing)
## density < 0.995 to the right, improve=0.00820, (0 missing)
## sulphates < 0.675 to the left, improve=0.00746, (0 missing)
## Surrogate splits:
## free.sulfur.dioxide < 29.5 to the left, agree=0.654, adj=0.246, (0 split)
## density < 0.995 to the right, agree=0.634, adj=0.203, (0 split)
## sulphates < 0.505 to the right, agree=0.628, adj=0.190, (0 split)
## residual.sugar < 4.15 to the left, agree=0.625, adj=0.182, (0 split)
## fixed.acidity < 6.85 to the right, agree=0.598, adj=0.124, (0 split)
##
## Node number 3: 1998 observations, complexity param=0.0321
## mean=6.22, MSE=0.786
## left son=6 (1100 obs) right son=7 (898 obs)
## Primary splits:
## alcohol < 11.7 to the left, improve=0.0760, (0 missing)
## free.sulfur.dioxide < 11.5 to the left, improve=0.0442, (0 missing)
## volatile.acidity < 0.587 to the right, improve=0.0405, (0 missing)
## density < 0.991 to the right, improve=0.0311, (0 missing)
## fixed.acidity < 7.05 to the right, improve=0.0117, (0 missing)
## Surrogate splits:
## density < 0.991 to the right, agree=0.712, adj=0.359, (0 split)
## volatile.acidity < 0.282 to the left, agree=0.582, adj=0.069, (0 split)
## sulphates < 0.365 to the right, agree=0.573, adj=0.050, (0 split)
## fixed.acidity < 5.95 to the right, agree=0.571, adj=0.046, (0 split)
## residual.sugar < 13.6 to the left, agree=0.553, adj=0.006, (0 split)
##
## Node number 4: 1498 observations
## mean=5.3, MSE=0.443
##
## Node number 5: 1269 observations
## mean=5.72, MSE=0.583
##
## Node number 6: 1100 observations, complexity param=0.0136
## mean=6, MSE=0.768
## left son=12 (174 obs) right son=13 (926 obs)
## Primary splits:
## volatile.acidity < 0.455 to the right, improve=0.0600, (0 missing)
## free.sulfur.dioxide < 11.5 to the left, improve=0.0533, (0 missing)
## pH < 3.5 to the right, improve=0.0157, (0 missing)
## sulphates < 0.675 to the left, improve=0.0154, (0 missing)
## fixed.acidity < 7.05 to the right, improve=0.0129, (0 missing)
## Surrogate splits:
## pH < 3.5 to the right, agree=0.853, adj=0.069, (0 split)
## density < 1 to the right, agree=0.847, adj=0.034, (0 split)
## fixed.acidity < 14.9 to the right, agree=0.844, adj=0.011, (0 split)
##
## Node number 7: 898 observations
## mean=6.49, MSE=0.675
##
## Node number 12: 174 observations
## mean=5.51, MSE=0.79
##
## Node number 13: 926 observations
## mean=6.09, MSE=0.709
##
## n= 4765
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 4765 3720 5.80
## 2) alcohol< 10.6 2767 1530 5.49
## 4) volatile.acidity>=0.282 1498 664 5.30 *
## 5) volatile.acidity< 0.282 1269 740 5.72 *
## 3) alcohol>=10.6 1998 1570 6.22
## 6) alcohol< 11.7 1100 845 6.00
## 12) volatile.acidity>=0.455 174 137 5.51 *
## 13) volatile.acidity< 0.455 926 657 6.09 *
## 7) alcohol>=11.7 898 606 6.49 *
The root node splits the full dataset based on the alcohol content of
the wines, with a split point of 10.6. Wines with alcohol < 10.6 go
to the left branch and wines with alcohol >= 10.6 go to the right
branch. The left branch then further splits on volatile acidity being
< or >= 0.282. This results in two leaf nodes predicting quality
scores of 5.30 and 5.72 for the low and high volatility acidity wines.
The right branch splits again on alcohol content < or >= 11.7. The
left sub-branch splits once more on volatile acidity to generate two
leaf nodes predicting quality of 5.51 and 6.09. The right-most leaf node
predicts a quality score of 6.49 for wines with the highest alcohol
content >= 11.7. So in total it has generated 5 leaf nodes predicting
quality scores ranging from 5.30 to 6.49 based on 3 split points on
alcohol content and 1 split point on volatile acidity.